9 research outputs found

    A Comprehensive Review of YOLO: From YOLOv1 and Beyond

    Full text link
    YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining the innovations and contributions in each iteration from the original YOLO to YOLOv8 and YOLO-NAS. We start by describing the standard metrics and postprocessing; then, we discuss the major changes in network architecture and training tricks for each model. Finally, we summarize the essential lessons from YOLO's development and provide a perspective on its future, highlighting potential research directions to enhance real-time object detection systems.Comment: 31 pages, 15 figures, 4 tables, submitted to ACM Computing Surveys This version includes YOLO-NAS and a more detailed description of YOLOv5 and YOLOv8. It also adds three new diagrams for the architectures of YOLOv5, YOLOv8, and YOLO-NA

    Loss Functions and Metrics in Deep Learning

    Full text link
    One of the essential components of deep learning is the choice of the loss function and performance metrics used to train and evaluate models. This paper reviews the most prevalent loss functions and performance measurements in deep learning. We examine the benefits and limits of each technique and illustrate their application to various deep-learning problems. Our review aims to give a comprehensive picture of the different loss functions and performance indicators used in the most common deep learning tasks and help practitioners choose the best method for their specific task.Comment: 53 pages, 5 figures, 7 tables, 86 equation

    Kalibracija Kinect V2 sustava s više kamera

    Get PDF
    In this paper, we propose a method to easily calibrate multiple Kinect V2 sensors. It requires the cameras to simultaneously observe a 1D object shown at different orientations (three at least) or a 2D object for at least one acquisition. This is possible due to the built-in coordinate mapping capabilities of the Kinect. Our method follows five steps: image acquisition, pre-calibration, point cloud matching, intrinsic parameters initialization, and final calibration. We modeled radial and distortion parameters of all the cameras, obtaining a root mean square re-projection error of 0.2 pixels on the depth cameras and 0.4 pixels on the color cameras. To validate the calibration results we performed point cloud fusion with color and 3D reconstruction using the depth and color information from four Kinect sensors.U ovom je radu predložena metoda za jednostavnu kalibraciju proizvoljnog broja senzora Kinect V2. Izvodi se istovremenim snimanjem objekta s više kamera. Jednodimenzionalan objekt potrebno je snimiti s najmanje 3 različite orijentacije, a dvodimenzionalan s najmanje jedne orijentacije. Istovremeno snimanje s više kamera moguće je zahvaljujući integriranom mapiranju koordinata u Kinect sustavu. Predložena metoda izvodi se u pet koraka: akvizicija slike, pred-kalibracija, usklađivanje oblaka točaka, inicijalizacija intrinzičnih parametara i konačna kalibracija. U radu su modelirani radijalni i distorzijski parametri svih kamera, pri čemu se ostvaruje korijen srednje kvadratične pogreške ponovne projekcije iznosa 0:2 piksela na kamerama dubine i 0:4 piksela na kamerama u boji. Za validaciju rezultata kalibracije provedena je fuzija oblaka točaka s rekonstrukcijom trodimenzionalnog objekta i boje korištenjem informacije o dubini i boji s četiri Kinect senzora

    Assistive wearable technology for dyadic interactions of visually impaired people

    No full text
    Tesis (Doctorado en Tecnología Avanzada), Instituto Politécnico Nacional, CICATA, Unidad Querétaro, 2016, 1 archivo PDF, (114 páginas). tesis.ipn.m

    Three-Dimensional Reconstruction of Indoor and Outdoor Environments Using a Stereo Catadioptric System

    No full text
    In this work, we present a panoramic 3D stereo reconstruction system composed of two catadioptric cameras. Each one consists of a CCD camera and a parabolic convex mirror that allows the acquisition of catadioptric images. We describe the calibration approach and propose the improvement of existing deep feature matching methods with epipolar constraints. We show that the improved matching algorithm covers more of the scene than classic feature detectors, yielding broader and denser reconstructions for outdoor environments. Our system can also generate accurate measurements in the wild without large amounts of data used in deep learning-based systems. We demonstrate the system’s feasibility and effectiveness as a practical stereo sensor with real experiments in indoor and outdoor environments

    Automatic Recognition of Mexican Sign Language Using a Depth Camera and Recurrent Neural Networks

    No full text
    Automatic sign language recognition is a challenging task in machine learning and computer vision. Most works have focused on recognizing sign language using hand gestures only. However, body motion and facial gestures play an essential role in sign language interaction. Taking this into account, we introduce an automatic sign language recognition system based on multiple gestures, including hands, body, and face. We used a depth camera (OAK-D) to obtain the 3D coordinates of the motions and recurrent neural networks for classification. We compare multiple model architectures based on recurrent networks such as Long Short-Term Memories (LSTM) and Gated Recurrent Units (GRU) and develop a noise-robust approach. For this work, we collected a dataset of 3000 samples from 30 different signs of the Mexican Sign Language (MSL) containing features coordinates from the face, body, and hands in 3D spatial coordinates. After extensive evaluation and ablation studies, our best model obtained an accuracy of 97% on clean test data and 90% on highly noisy data

    Automatic Recognition of Mexican Sign Language Using a Depth Camera and Recurrent Neural Networks

    No full text
    Automatic sign language recognition is a challenging task in machine learning and computer vision. Most works have focused on recognizing sign language using hand gestures only. However, body motion and facial gestures play an essential role in sign language interaction. Taking this into account, we introduce an automatic sign language recognition system based on multiple gestures, including hands, body, and face. We used a depth camera (OAK-D) to obtain the 3D coordinates of the motions and recurrent neural networks for classification. We compare multiple model architectures based on recurrent networks such as Long Short-Term Memories (LSTM) and Gated Recurrent Units (GRU) and develop a noise-robust approach. For this work, we collected a dataset of 3000 samples from 30 different signs of the Mexican Sign Language (MSL) containing features coordinates from the face, body, and hands in 3D spatial coordinates. After extensive evaluation and ablation studies, our best model obtained an accuracy of 97% on clean test data and 90% on highly noisy data
    corecore